Method and system for performing information extraction and quality control for a knowledgebase

ABSTRACT

The present invention relates to the field of information extraction and storage and more specifically to techniques for extracting information from a plurality of articles in a distributed manner and for storing the extracted information in an information store. an embodiment of the present invention identifies a plurality of articles from which information is to be extracted and a plurality of information extractors for extracting the information from the articles. A database is provided for storing information related to the plurality of articles and the plurality of information extractors. The plurality of articles are assigned to the plurality of information extractors for information extraction. Information extracted by information extractors from the articles is stored in the information store.

CROSS-REFERENCES TO RELATED APPLICATIONS

[0001] This application is a continuation-in-part of co-pending U.S.application Ser. No. 09/733,495, entitled “Techniques For FacilitatingInformation Acquisition and Storage”, filed Dec. 8, 2000, previouslyassigned to the assignee of the present application, Ingenuity Systems,Inc. The entirety of the earlier filed co-pending patent application ishereby expressly incorporated herein by reference.

COPYRIGHT NOTICE

[0002] A portion of the disclosure of this patent document containsmaterial which is subject to copyright protection. The copyright ownerhas no objection to the xerographic reproduction by anyone of the patentdocument or the patent disclosure in exactly the form it appears in theU.S. Patent and Trademark Office patent file or records, but otherwisereserves all copyright rights whatsoever.

BACKGROUND OF THE INVENTION

[0003] The present invention relates to the field of informationextraction and storage and more specifically to techniques for managinga distributed information acquisition and information storage process.

[0004] There has been and will continue to be an explosion in the volumeand complexity of information available to information consumers.However, due to the magnitude of disparate information available in thepublic domain, information consumers are typically able to access,comprehend, and meaningfully use only a very small percentage of theavailable information. This is primarily because the information istypically buried in articles which may be contained in magazines,journals, papers, newspapers, books, notebooks, etc. or is stored indigital format in information stores such as databases, digitallibraries, etc. Unless otherwise stated, the term “article” as used inthis application should be construed to include any transcribed orprinted information, or information available in digital format, orcombinations or portions thereof. The information in an article mayinclude text, graphics, charts, audio information, video information,multimedia information, and other types of information in variousformats. An article may be published or unpublished. Since thesearticles could number in the hundreds and thousands, they cannot all beaccessed, read, and understood by an information consumer in a practicaltimeframe. While several data warehousing techniques have been used tointegrate information from various articles, these techniques are notflexible enough to keep up with the proliferation of availableinformation. They also rarely help with the information overloadproblem. In fact, by aggregating data, these data warehousing techniquesoften make the information overload problem worse.

[0005] One field that has seen a tremendous explosion of information inthe past decade is the life sciences field which has benefited from theexponential growth in the identification and functional characterizationof genes in the biological sciences. A decade ago a laboratory notebookwas often sufficient for “data warehousing.” A researcher could rely onhis or her deep understanding of a handful of genes to make informeddecisions regarding his or her research. Today, the influx ofinformation and the blurring of traditional biological researchboundaries have outstripped the ability of a researcher to fullyassimilate, synthesize, and evaluate research data. The primaryimpediment for a researcher is not the lack of information; rather it isthe large quantity and unstructured format used to store theinformation. To evaluate results of large-scale experiments, researchersrely heavily on published research literature to identify the keyinformation that is critical for them to make informed decisions. Thevast number of articles, the unstructured format of the information, andthe inability of the researchers to query on specific experimentalresults dictates that the review of the literature may take severaldays, weeks, or even more of a researcher's time. In addition to beingvery time intensive, the accumulation of knowledge by the researcher isnot easily transferable to other researchers because it is not in aneasily accessible format.

[0006] Based on the above, there is a need for techniques which canextract information from the various sources and store it in a formatwhich can be easily accessed or queried by an information consumer. Itis also desirable that the techniques be flexible enough to keep pacewith the proliferation of information. Further, it is also desirablethat the techniques be adaptable to extract and store informationrelated to various domains and fields.

SUMMARY OF THE INVENTION

[0007] The present invention discusses techniques for extractinginformation from a plurality of articles and for storing the extractedinformation in an information store. According to an embodiment, thepresent invention identifies a plurality of articles from whichinformation is to be extracted. The present invention also identifies aplurality of information extractors for extracting information from theplurality of articles. A database is provided for storing informationrelated to the plurality of articles and the plurality of informationextractors. According to this embodiment, the present invention assignsthe plurality of articles to the plurality of information extractors-f6r information extraction. The present invention receives informationextracted by an information extractor from an article assigned to theinformation extractor. The extracted information is then stored in theinformation store.

[0008] According to an embodiment of the present invention, theinformation store is a knowledge base which is configured to store theextracted information according to an ontology. In this embodiment,information may be extracted from articles using a fact-based model.

[0009] According to another embodiment, the present invention enablesquality control processing to be performed on the information extractedby the information extractor before the extracted information is storedin the information store. According to this embodiment, the presentinvention enables a content reviewer to review the extracted informationreceived from the information extractor. The present invention mayreceive information from the content reviewer identifying errorsassociated with the extracted information.

[0010] According to an embodiment, the present invention determines,from the information received from the content reviewer, an error countindicating number of errors in the extracted information received fromthe information extractor. If the error count is above a threshold errorcount level, the article may be reassigned to the information extractorfor information extraction. If the error count is equal to or below thethreshold error level, the present invention may provide servicesenabling the content reviewer to change the extracted informationreceived from the information extractor to correct the errors.

[0011] According to another embodiment, the present invention calculatesthe compensation due to information extractors for extractinginformation from the articles. The compensation amount for aninformation extractor may be calculated based on several criteria suchas the number of errors in the information extracted by the informationextractor, a quality score assigned to the article, and other metricsinformation captured during quality control processing.

[0012] According to yet another embodiment, the information store isconfigured to store the extracted information according to aninformation model. In this embodiment, the present invention allowsreviewers to review the extracted information and make changes, if any,to the information model to accommodate the extracted information. Inthis embodiment, the present invention may allow a reviewer to reviewthe extracted information and new concepts introduced by the extractedinformation and to provide information identifying changes, if any, tobe made to the information model. According to a specific embodiment,the information provided by the reviewer may then be reviewed by asecond reviewer. After the second reviewer has approved of the changes,the information model may be changed. In a specific embodiment, theinformation store is a knowledge base which is configured to store theextracted information according to an ontology. The present inventionprovides services enabling ontologists to review new concepts and tomake changes to the ontology to accommodate the new concepts. Otherinformation models may also be used in conjunction with the presentinvention.

[0013] Further understanding of the nature and advantages of the presentinvention may be realized by reference to the remaining portions of thespecification and the attached drawings.

BRIEF DESCRIPTION OF THE DRAWINGS

[0014]FIG. 1 is a simplified block diagram of a distributed computernetwork which may incorporate an embodiment of the present invention;

[0015]FIG. 2 is a simplified block diagram of a computer system whichmay incorporate an embodiment of the present invention;

[0016]FIG. 3 is a simplified flowchart showing processing performed byan embodiment of the present invention to facilitate informationextraction and storage;

[0017]FIG. 4 is a simplified flowchart showing processing performed byan embodiment of the present invention for identifying informationextractors;

[0018]FIG. 5 is a simplified flowchart showing quality controlprocessing performed by an embodiment of the present invention;

[0019]FIG. 6 is a simplified flowchart showing processing performed byan embodiment of the present invention for calculating the compensationdue to an information extractor;

[0020]FIG. 7 depicts an exemplary web page which may be displayed to theinformation extractor;

[0021]FIG. 8 is a simplified flowchart showing processing performed byan embodiment of the present invention for reviewing new concepts orterms and making changes to the ontology to accommodate the new conceptsor terms; and

[0022]FIGS. 9A-9C depict information which may be stored in a databaseaccording to an embodiment of the present invention.

DESCRIPTION OF THE SPECIFIC EMBODIMENTS

[0023] The present invention provides techniques for extractinginformation or knowledge from a plurality of articles in a distributedmanner and for storing the extracted information or knowledge in astructured format which can be accessed or queried by informationconsumers. Techniques are discussed for managing the process ofinformation extraction and storage. FIG. 1 is a simplified block diagramof a distributed computer network 10 which may incorporate an embodimentof the present invention. Computer network 10 includes a number ofcomputer systems 12, 14-1, 14-2, and 14-3 coupled to a communicationnetwork 16 via a plurality of communication links 18. The computersystems include a plurality of client computer systems 14-1, 14-2, and14-3, and a server computer system 12. Client systems 14 typicallyrequest information from a server computer system, which performsprocessing in response to the client request and provides the requestedinformation to the client systems. For this reason, servers typicallyhave more computing and storage capacity than client systems. However, aparticular computer system may act both as a client or a serverdepending on whether the computer system is requesting or providinginformation.

[0024] Communication network 16 provides a mechanism for allowing thevarious components of distributed network 10 to communicate and exchangeinformation with each other. Communication network 16 may itself-becomprised of many interconnected computer systems and communicationlinks. Communication links 18 may be hardwire links, optical links,satellite or other wireless communications links, wave propagationlinks, or any other mechanisms for communication of information. Whilein one embodiment, communication network 16 is the Internet, in otherembodiments, communication network 16 may be any suitable computernetwork. Distributed computer network 10 depicted in FIG. 1 is merelyillustrative of an embodiment incorporating the present invention anddoes not limit the scope of the invention as recited in the claims. Oneof ordinary skill in the art would recognize other variations,modifications, and alternatives. For example, more than one serversystem 12 may be coupled to communication network 16.

[0025] According to the teachings of the present invention, serversystem 12 is responsible for receiving information extracted from thevarious articles, for processing the information, and storing it in aformat which allows information consumers to query or access theinformation. The term “server system” as used in this application mayrefer to a single server system as depicted in FIG. 1, or may refer toone or more server systems distributed within computer network 10.Accordingly, functions or tasks performed by the present invention maybe distributed to one or more servers coupled to communication network16. According to a specific embodiment, the servers may be isolatedbehind firewalls for security purposes and communication between theservers may be encoded and encrypted.

[0026] According to an embodiment of the present invention, theextracted information may be stored in an information store 15 coupledto server 12. The information store may be a database, a knowledge base,file server, or any other type of storage mechanism. The term“information store” as used in this application may refer to a singleinformation store or to a plurality of information stores distributedwithin computer network 10. For example, information store 15 may belocally coupled to server 12 or may be distributed across distributedcomputer network 10 and accessed by server 12 via communication network16.

[0027] In a specific embodiment of the present invention, informationstore 15 is a knowledge base configured to store information accordingto an ontology. An ontology is a knowledge representation of the realworld or some portion of the real world. An ontology is typicallycomprised of “individuals” which represent single things or elements,“classes” which represent a group of things that share similarproperties, “slots” which represent relationships between the things,“facets” which represent detailed information about the slots,“relations” which represent detailed relationships between theaforementioned things, and other information. Relations may include butare not limited to taxonomic relationships and partonomic relationships.An ontology may comprise a plurality of branches based on theserelationships.

[0028] Server system 12 may be configured to perform a plurality offunctions according to the teachings of the present invention. Thesefunctions are typically performed by software code modules executing onserver system 12. The functions may also be performed by hardwaremodules coupled to server system 12, or by a combination of software andhardware modules. Functions performed by server 12 include facilitatingidentification of articles from which information is to be extracted,determining information extractors who will be responsible forextracting the information from the articles, certifying the informationextractors in techniques of information extraction, assigning articlesto the information extractors for information extraction, receivinginformation extracted by the information extractors from the articles,facilitating performance of quality control activities to ensure thecorrectness and accuracy of the extracted information, enabling users tochange the model for storing the information, storing information ininformation store 15, and performing other functions according to theteachings of the present invention. Details related to the variousfunctions performed by server system 12 are described below.

[0029] As shown in FIG. 1, a database 13 may be coupled to server 12.Database 13 may be used to store information associated with processingperformed by the present invention for extracting information from thearticles. The information stored in database 13 may also be used to keeptrack of the various steps of the information extraction and storageprocess. For example, the status or progress of any particular step ofthe information acquisition process can be ascertained from theinformation stored in database 13. Additionally, information related tothe various users of the present invention, and the status of theextracted information as it progresses through the process may also bestored in database 12. The users may also be classified into variousgroups, and roles and permissions may be assigned to the users based onthe groups to which the users belong. Information related to the groupsand roles and permissions associated with the groups may also be storedin database 13.

[0030] The term “database 13 ” as used in this application may refer toa single database or to a plurality of databases distributed withincomputer network 10. For example, database 13 be locally coupled toserver 12 or may be distributed across computer network 10 and accessedby server 12 via communication network 16. Database 13 may be arelational database, an object-relational database, an object-orienteddatabase, a knowledge base, a flat file; or any other way of storinginformation. It should be apparent that although FIG. 1 depictsinformation store 15 and database 13 as two separate entities, in aspecific embodiment of the present invention, information store 15 anddatabase 13 may be combined into a single information store or database.

[0031] Client systems 14 may be used to interact with server 12. Forexample, client systems 14 may be used by information extractors toinput information extracted from the articles. Client systems. 14 mayalso be used by users to apply to become information extractors. Once auser has been appointed/designated as an information extractor, the usermay use client system 14 to participate in certification and testingactivities related to the information extraction process-which may beoffered by server system 12. Client systems 14 may also be used toparticipate in quality control and information model review activitiesprovided by modules executing on server system 12.

[0032]FIG. 2 is a simplified block diagram of an exemplary computersystem 20 according to an embodiment of the present invention. Computersystem 20 typically includes at least one processor 24, whichcommunicates with a number of peripheral devices via bus subsystem 22.These peripheral devices typically include a storage subsystem 32,comprising a memory subsystem 34 and a file storage subsystem 40, userinterface input devices 30, user interface output devices 28, and anetwork interface subsystem 26. The input and output devices allow userinteraction with computer system 20. It should be apparent that the usermay be a human user, a device, another computer, and the like. Networkinterface subsystem 26 provides an interface to outside networks,including an interface to communication network 16, and is coupled viacommunication network 16 to corresponding interface devices in othercomputer systems.

[0033] User interface input devices 30 may include a keyboard, pointingdevices such as a mouse, trackball, touchpad, or graphics tablet, ascanner, a barcode scanner for scanning article barcodes, a touchscreenincorporated into the display, audio input devices such as voicerecognition systems, microphones, and other types of input devices. Ingeneral, use of the term “input device” is intended to include allpossible types of devices and ways to input information into computersystem 20 or onto computer network 16.

[0034] User interface output devices 28 may include a display subsystem,a printer, a fax machine, or non-visual displays such as audio outputdevices. The display subsystem may be a cathode ray tube (CRT), aflat-panel device such as a liquid crystal display (LCD), or aprojection device. The display subsystem may also provide non-visualdisplay such as via audio output devices. In general, use of the term“output device” is intended to include all possible types of devices andways to output information from computer system 20 to a human or toanother machine or computer system.

[0035] Storage subsystem 32 stores the basic programming and dataconstructs that provide the functionality of the various systemsembodying the present invention. For example, the various modulesimplementing the functionality of the present invention may be stored instorage subsystem 32. These software modules are generally executed byprocessor(s) 24. In a distributed environment, the software modules maybe stored on a plurality of computer systems and executed by processorsof the plurality of computer systems. Storage subsystem 32 also providesa repository for storing the various databases storing informationaccording to the present invention. Storage subsystem 32 typicallycomprises memory subsystem 34 and file storage subsystem 40.

[0036] Memory subsystem 34 typically includes a number of memoriesincluding a main random access memory (RAM) 38 for storage ofinstructions and data during program execution and a read only memory(ROM) 36 in which fixed instructions are stored. File storage subsystem40 provides persistent (non-volatile) storage for program and datafiles, and may include a hard disk drive, a floppy disk drive along withassociated removable media, a Compact Digital Read Only Memory (CD-ROM)drive, an optical drive, removable media cartridges, and other likestorage media. One or more of the drives may be located at remotelocations on other connected computers at another site on communicationnetwork 16. Information stored according to the teachings of the presentinvention may also be stored by file storage subsystem 40.

[0037] Bus subsystem 22 provides a mechanism for letting the variouscomponents and subsystems of computer system 20 communicate with eachother as intended. The various subsystems and components of computersystem 20 need not be at the same physical location but may bedistributed at various locations within distributed network 10. Althoughbus subsystem 22 is shown schematically as a single bus, alternativeembodiments of the bus subsystem may utilize multiple busses.

[0038] Computer system 20 itself can be of varying types including apersonal computer, a portable computer, a workstation, a computerterminal, a network computer, a television, a mainframe, or any otherdata processing system. Due to the ever-changing nature of computers andnetworks, the description of computer system 20 depicted in FIG. 2 isintended only as a specific example for purposes of illustrating thepreferred embodiment of the present invention. Many other configurationsof a computer system are possible having more or less components thanthe computer system depicted in FIG. 2. Client computer systems 14 andserver computer systems 12 generally have the same configuration asshown in FIG. 2, with the server systems generally having more storagecapacity and computing power than the client systems.

[0039]FIG. 3 is a simplified flowchart 50 showing processing performedby an embodiment of the present invention to facilitate the informationextraction and storage process. As shown in FIG. 3, the processcomprises a number of steps or stages. Status information related toeach of the stages is maintained by server 12. Modules performingprocessing according to flowchart 50 are also responsible forcontrolling the flow and distribution of articles and informationthrough the various stages of flowchart 50. Processing is initiated byidentifying the articles from which the information is to be extracted(step 56). As previously indicated, the term “article” as used in thisapplication should be construed to include any transcribed or printedinformation, or information available in digital format, or combinationsor portions thereof. The information in an article may include text,graphics, charts, audio information, video information, multimediainformation, and other types of information in various formats. Anarticle may be published or unpublished.. Further, the term“information” as used in this application should be construed to includecontent, data, knowledge, and other types of information which may beextracted from the articles.

[0040] Several different techniques may be used to identify thearticles. According to a first technique, information 54 identifying thearticles from which information is to be extracted may be specificallyprovided to server 12. According to another technique, user criteria 52,which is to be used by server 12 to search for articles from whichinformation is to be extracted, may be provided to server 12. Accordingto a specific embodiment of the present invention, information 54 anduser criteria 52 may be used independently to identify the articles. Inalternative embodiments of the present invention, various combinationsof information 54 and user criteria 52 may be used to identify thearticles.

[0041] The user criteria may be used to characterize the type ofarticles to be found. Users of the present invention may use usercriteria 52 to tailor the search performed by server 12 to identifyarticles related to a particular domain or field or industry. Usercriteria 52 may include keywords specific to the domain, names ofpublications, names of journals, newspaper names, databases names,digital libraries, various concepts, names of authors, publicationdates, etc. related to the domain, and other like information.

[0042] For example, for the life sciences field, user criteria 52 mayinclude keywords such as names of genes, names of array techniques,names of proteins and amino acids, gene sequences, gene expressionprofiles, drug names, concepts, experimental methods and techniques,names of publications and journals, publication dates, etc. Usercriteria 52 may also identify publications such as Nature, Cell,Science, Nature Medicine, Nature Genetics, Proceedings of the NationalAcademy of Sciences (PNAS), Journal of Biological Chemistry, EuropeanMolecular Biology Organization (EMBO) publications, Journal of CellBiology, Genes and Development, Molecular and Cellular Biology, etc. tobe included in the search. User criteria 52 may also identify databases,including public and private databases (when permitted), to be searchedsuch as the Medline database, the Genbank database, the SwissProtdatabase, the ProSite database, the Interpro database, the LocusLinkdatabase, the Unigene database, and various other databases. Variousother types of information related to the life sciences domain may alsobe included in user criteria 52.

[0043] User criteria 52 provided to server 12 may be stored in database13 coupled to server 12. Based upon the user criteria, server 12searches the various resources coupled to distributed network 10 toidentify articles which satisfy and are relevant to the user criteria.As previously stated, the resources which are searched by server 12 mayinclude magazines repositories, journals, research papers, newspapers,books, and other material repositories. The resources may also includeonline databases, digital libraries, data banks, etc. coupled tocommunication network 16. Server 12 may use various search techniques toidentify articles which are relevant to the user criteria. Thesetechniques may include techniques using natural language processing toperform the search(es), techniques using synonyms and word/phraseexpansion, and other like techniques. Further, server 12 may perform asingle search or a plurality of searches based upon the user criteria orbased on results of previous searches.

[0044] The searches performed by server 12 may yield one or morearticles. According to a specific embodiment, the articles identifiedvia the searches may be grouped into categories based on the degree ofrelevancy of the articles to the user criteria. Server 12 may alsofilter the articles based upon the degree of relevancy of the articles.For example, an article whose degree of relevancy to the user criteriais below a threshold value may be filtered out by server 12 as part ofstep 56. The threshold value may be user-configurable. In alternativeembodiments, a filter based on natural language processing (NLP) may beused to identify articles which are relevant to the user criteria. Theuser may also indicate that articles from particular sources are not tobe considered for information extraction purposes. Server 12 may thenautomatically filter out articles from these particular sources. Thearticles may also be categorized based on other criteria such as thesource of the articles, publication dates of the articles, author(s) ofthe articles, etc. The categorization criteria may be configured by theuser of the present invention and provided to server 12. For example,the user may indicate that articles from a particular set of journalsare to be grouped into one category. It should be apparent that thefiltering and categorization techniques are user configurable.

[0045] The output of step 56 comprises a filtered or categorized list ofarticles, which may include articles explicitly identified by the userand/or articles identified via searches performed by server 12.Information related to these articles is stored in database 13 (step58). For each article, the stored information may include descriptiveinformation about the article such as the title of the article, theauthor(s) of the article, the source of the article, the publicationdate of the article, and other like information related to the article.The stored information may also indicate whether the article wasspecifically identified by the user or identified via a search,information related to the categorization of the article, etc.Information related to articles which are filtered out in step 56 mayalso be stored in database 13 for reference purposes. Informationrelated to articles which could not be unambiguously categorized in step56 may also be stored in database 13. This information allows thenon-categorized articles to be manually categorized. Information relatedto the manual categorization of the articles is also stored in database13. According to a specific embodiment of the present invention, server12 assigns a unique article identifier to each article. The articleidentifier allows a user of the present invention to query or track thestatus of an article during the information extraction and informationstorage process.

[0046] As part of step 58, server 12 also stores (in database 13) accessinformation for each article which enables information extractors toaccess the article in order to extract information from the article.According to an embodiment, this information may include the title ofthe article, the author(s) of the articles, the source of the article,etc. An information extractor may then use this information to accessthe article. According to another embodiment, server 12 may storeuniform resource locator (URL) information for the article indicating aweb site from which the article may be accessed by an informationextractor.

[0047] According to yet another embodiment of the present invention, ifpermitted, server 12 may procure and store digital copies of thearticles as part of step 58. In this embodiment, server 12 determines,from the list of articles identified in step 56, articles which areelectronically available (i.e. available in digital format), and thosewhich are not. For articles which are electronically available, server12, if permitted, automatically accesses the digital versions of thearticles. Server 12 may determine if access to the articles is permittedon an article-by-article basis. The present invention may be configuredto access various types of digital formats such as PDF format,Postscript format, word processor generated formats, text formats, HTMLformats,, and several other formats. According to an embodiment, server12, if permitted, makes digital copies of the articles and stores thecopies in database 13. In alternative embodiments of the presentinvention, the digital copies may be stored by other components depictedin FIG. 1, e.g. the copies may be stored on a file server coupled tocommunication network 16. If the present invention is not permitted tomake digital copies of the articles, server 12 may store informationrelated to the articles which allows information extractors to -accessthe articles. For example, as previously stated, server 12 may store aURL corresponding to the article which may be used to display thearticle, even if the article is stored on a foreign site. For articleswhich are not available in digital format, copies of the articles may beobtained manually. The manually obtained copies may then be scanned, ifpermitted, to produce digital versions of the articles. The digitalversions may then be stored, for example, in database 13 or on a fileserver. As previously stated, if the present invention is not permittedto make digital versions of the articles, server 12 may storeinformation related to the articles which allows information extractorsto access the articles.

[0048] After information for the articles has been stored in database13, server 12 may set the status of the articles in database 13 toindicate that the articles are now ready for information extraction.According to an embodiment of the present invention, processing thencontinues with step 64 or step 60.

[0049] According to an embodiment of the present invention, the presentinvention generates an ordered listing (or “queue”) of the articleswhich have been tagged as ready for information extraction (step 60).The position of an article in the queue determines the order in whichthe article will be presented to an information extractor forinformation extraction—an article with a higher ranking in the orderedlist will be presented for information extraction before an article witha lower ranking. Ordering the articles in this manner ensures thatarticles which are deemed “more important,” and hence assigned a higherpriority, will be presented for information extraction before articleswhich are deemed “less important.” This also allows the presentinvention to make optimal use of information extraction resources. Forexample, given a finite set of information extractors, the orderedlisting ensures that information from the “more important” articles willbe extracted before the resources are used to extract information fromthe “less important” articles. It should be apparent that each articlein the queue may be represented by information related to the article,such as a. URL corresponding to the article, descriptive information forthe article, a digital copy of the article, etc.

[0050] The order of an article in the queue is determined by a priorityscore generated by server 12 and associated with the article. Articleswith higher priorities are assigned higher priority score and are thusranked higher up the ordered list than articles with lower priorities.The priority for each article may be calculated based on characteristicsof the article and using user-configurable priority calculationtechniques/algorithms. For example, an article may be prioritized basedon the categorization of the article in step 56. Articles that are morerelevant to the user criteria may be assigned higher priorities thanarticles with lower degrees of relevancy to the user criteria. Server 12may also prioritize articles based upon prioritization criteria 61configured by the user of the present invention and stored in database13. Prioritization criteria 61 may include information related to thesources of articles, i.e. the journal, magazine, or database containingthe article, the date of publication of articles, author(s) of thearticles, and other like information. For example, articles fromspecific journals identified by the user as “more important” journalsmay be assigned a higher priority score than articles from othersources. Information related to priority scores associated with thearticles and the subsequent ranking of the articles in the queue isstored in database 13. The priority score associated with an article maybe periodically changed by server 12 if the criteria for prioritizationchanges or if the algorithm used for calculating the priority changes.The priority score may be recalculated individually for each article orfor a whole collection of articles. This change is dynamically reflectedin the ordered listing.

[0051] According to another embodiment of the present invention, insteadof prioritizing the articles into a single queue, server 12 mayprioritize the articles into multiple queues corresponding to differentsubjects or areas of discussion. For example, in the life sciencesfield, server 12 may generate a queue for articles discussing oncologyrelated topics, a queue for articles discussing cardiovascular diseasesrelated topics, a queue for articles discussing topics related to genefunction, and so on. Organizing the articles in this manner facilitatesassignment of the articles to information extractors with specialexpertise in a particular area within the domain. For example, anarticle from the oncology queue may be assigned to an informationextractor with expertise in oncology.

[0052] In parallel to identifying the articles, the present inventionalso performs processing to identify information extractors who will beresponsible for extracting the information from the articles (step 62).These information extractors may be human beings who have been selectedby users of the present invention to extract information from thearticles. In alternative embodiments of the present invention, theinformation extractors may also be application programs which can beconfigured to automatically extract information from the articles. Theprocess for facilitating selection of information extractors, accordingto an embodiment of the present invention, is described below.

[0053]FIG. 4 is a simplified flowchart 90 showing processing performedby server 12 for facilitating identification of information extractorsaccording to step 62 in FIG. 3. The process is generally initiated whenserver 12 identifies a set of potential candidates for performinginformation extraction (step 98). The set of candidates are generallyselected from a plurality of candidates who have expressed an interestin becoming information extractors.

[0054] The present invention may use several techniques to identify theset of potential candidates. According to a specific embodiment, server12 may receive information 92 related to candidates who are interestedin becoming information extractors. Candidates may provide information92 to server 12 using client systems 14. In this manner, candidates,irrespective of their geographical locations, can apply to becomeinformation extractors. The candidate information may be in the form ofa resume or other information about the candidate and may be stored byserver 12 in database 13. Server 12 may then be configured toautomatically compare the threshold requirements 96 for becoming aninformation extractor (generally provided by the user of the presentinvention) with the candidate information to identify a set ofcandidates whose qualifications equal or exceed the thresholdrequirements. Several commercial-off-the-shelf (COTS) resume matchingproducts may also be used by the present invention to automaticallyperform the comparison to identify the set of potential candidates.Threshold qualification information 96 is user configurable.

[0055] According to another embodiment, server 12 may utilize servicesand information provided by a hiring system or a resume managementsystem to identify the potential list of candidates. For example, server12 may use a resume management system to query databases on the Internetwhere candidates have deposited resumes and to receive information 93identifying candidates who satisfy/meet the minimum requirements forbecoming information extractors.

[0056] In alternative embodiments of the present invention, informationidentifying the set of potential candidates may be specifically providedto server 12 by users of the present invention.

[0057] According to the teachings of the present invention, informationrelated to the set of potential candidates identified in step 98 may bestored in database 13. For example, for each candidate selected in step98, server 12 stores information related to the candidate in database13. The stored information may include the name of the candidate, thecandidate's contact information, the candidate's academic information,the candidate's work experience, any special expertise of the candidate,and other like information. Server 12 may also assign a uniqueidentifier to each selected candidate to uniquely identify thecandidate. The identifier information may be stored in database 13 andmay be used to track the status of the candidate. Server 12 may also setaccess rights for each selected candidate allowing the selectedcandidate to access online certification modules provided by server 12.

[0058] The selected candidates then undergo a certification process tolearn about procedures and protocols for extracting information from thearticles (step 100). According to an embodiment of the presentinvention, server 12 provides online certification modules which may beaccessed by the selected candidates via client systems 14. Thecertification process typically explains the protocols/procedures to befollowed by each information extractor for extracting information fromthe articles. Such protocols ensure that information from a plurality ofheterogenous articles is extracted in a coherent, standard, andhomogenous format. An example of a protocol which may be used forinformation extraction is described in Appendix A. The certificationprocess may also introduce and explain the use of information extractiontools used by the information extractors for extracting information.According to an embodiment of the present invention, as part of thecertification process, each candidate is allowed to use software toolswhich are used by information extractors for extracting information fromthe articles.

[0059] A candidate's, progress through the certification process may betracked by server 12 and stored in database 13. For example, aftersuccessful completion of a certification module, information stored indatabase 13 associated with the candidate may be updated to indicatesuccessful completion of the module by the candidate. In this manner, acandidate's progress through the certification process can be easilytracked.

[0060] After server 12 determines that a candidate has successfullycompleted the certification process (step 102), the candidate is thentagged as being eligible to be tested to determine if the candidate hasacquired sufficient skills to qualify as an information extractor.According to an embodiment of the present invention, information storedin database 13 associated with the candidate is updated to indicate thatthe candidate has successfully completed the certification process andis ready to be tested. Access rights associated with the candidate areupdated to allow the candidate to participate in online testing.

[0061] Several different testing techniques may be used. According to afirst technique, a candidate may be deemed to have passed the test uponsuccessful completion of the certification modules and associatedpractice exercises. According to another technique, the candidate may berequired to take an online test (step 104) provided by server 12, andappointment of the candidate as an information extractor may becontingent on the results of the test. After server 12 determines that acandidate has successfully passed the test (step 106), the candidate isthen certified and designated as an information extractor (step 108). Ifa candidate fails the test, the candidate may be allowed to retake thetest (step 104) or may be disqualified from becoming an informationextractor (step 107). In alternative embodiments of the presentinvention, the certification and testing activities may also beperformed in an offline environment. However, performing the activitiesin an online distributed manner allows the present invention to harnessthe power of communication networks such as the Internet to expand thereach of the information extraction process.

[0062] According to an embodiment of the present invention, informationstored in database 13 for a candidate is updated to indicate that thecandidate has successfully completed the testing process and has beendesignated as an information extractor. According to an embodiment ofthe present invention, as part of step 108, the candidate may be askedto enter into contractual agreements with the user of the invention.These contractual agreements may contain terms related to non-disclosureclauses, terms related to the information extractor's compensation, andother terms. In a specific embodiment, the information extractor is paidfor extracting information on a per article basis. According to anembodiment of the present invention, the contractual process can beaccomplished online using features such as digital signatures, and thelike. Information related to the contract signed by the informationextractor is stored in database 13. Access rights associated with thecandidate are updated to-allow the information extractor to gain accessto articles marked for information extraction.

[0063] Referring back to FIG. 3, after the information extractors havebeen identified in step 62, the articles tagged for informationextraction are then assigned to the information extractors forinformation extraction (step 64). One or more articles may be assignedto each information extractor for information extraction. An article mayalso be simultaneously assigned to more than one information extractor.Assigning an article to more than one information extractor enablesredundant information acquisition.

[0064] Several different techniques may be used for assigning articlesto the information extractors. According to an embodiment of the presentinvention in which the articles which are ready for informationextraction are not queued by server 12 (i.e. step 60 is not performed),the articles may be assigned to the information extractors in apre-configured or random manner. Alternatively, an information extractormay be allowed to select an article for information extraction.

[0065] In an embodiment of the present invention in which server 12prioritizes the articles into a queue, the articles may be assigned tothe information extractors in order starting with the first article inthe queue. As previously stated, this ensures that articles which are“more important” will be presented for information extraction beforearticles which are deemed “less important,” thus making optimal use ofthe information extraction resources.

[0066] According to another embodiment of the present invention, server12 may create a queue for each information extractor and the articlesfrom the queue generated in step 60 may be assigned to each informationextractor's queue. Server 12 may periodically prioritize the articles inthe main queue and in the individual information extractor queues. Theinformation extractors may also be organized into groups with a queuefor each group. Articles from the queue generated in step 60 may then beassigned to the group queues.

[0067] According to yet another embodiment, server 12 may assignarticles based on the expertise of the information extractor. Forexample, in the embodiment wherein server 12 prioritizes the articlesinto multiple queues based on the topic of discussion of the articles,server 12 may assign articles to an information extractor from a queuewhich stores articles related to the field of expertise of theinformation extractor. For example, articles from the oncology queue maybe assigned to an information extractor with expertise in the field ofoncology.

[0068] The information in database 13 for each assigned article may beupdated to indicate that the article has been assigned to an informationextractor for information extraction. The information stored in database13 for each assigned article may comprise information identifying theinformation extractor to whom the article was assigned, the date whenthe article was assigned to the information extractor, and other likeinformation. Likewise, information stored in database 13 for aninformation extractor may also be updated to indicate that articles havebeen assigned to the information extractor for information extraction.For each information extractor the stored information may indicate thenumber of articles assigned to the information extractor, informationidentifying the assigned articles, the dates when the articles wereassigned, and other like information.

[0069] Server 12 then receives information extracted by the informationextractors from articles assigned to the information extractors (step66). Information extractors may input the extracted information usingclient systems 14. As previously stated, information extractors mayaccess the articles using information stored in database 13. Forexample, an information extractor may use URL information for an articleto access the article. In another embodiment, the information extractormay use descriptive information related to an article to access a hardcopy of the article. In embodiments where database 13 stores digitalversions of the articles, an information extractor, when permitted, mayaccess the stored digital version of the article using client system 14.After accessing an article, the information extractor extractsinformation from the article and inputs the extracted information toserver 12. The information may be extracted according to a protocolestablished by the user of the present invention (such as the protocoldescribed in Appendix A).

[0070] According to an embodiment of the present invention, server 12may provide user interfaces and services to facilitate entry of theextracted information. These user interfaces and services may beaccessed by an information extractor using client system 14. Server 12may provide several techniques allowing the information extractors toinput the extracted information. According to a first technique, theinformation extractor may enter the extracted information in the form ofnatural language sentences. According to another technique, server 12may provide templates for entering the extracted information. Accordingto yet another technique, server 12 may provide features allowinginformation extractors to input the extracted information via picturesor diagrams, speech, fax, e-mail, or handwriting, or using anycombinations of the aforementioned techniques and other techniques.Server 12 may also allow/enable information extractors to input theextracted information using combinations of the aforementionedtechniques and other techniques. Server 12 may then process theinformation entered by the information extractor to determineinformation to be stored in information store 15.

[0071] For example, according to an embodiment of the present invention,information store 15 may be a frame-based knowledge base and theprotocol for extracting the information may be based on a fact modele.g. the protocol described in Appendix A. In this embodiment, theextracted information input by an information extractor may comprise oneor more facts and information associated with the facts. A fact (or“finding”) may refer to a piece of information having a definedstructure and which is extracted from the articles according to aprotocol/procedure. A fact may be comprised of discrete objects andprocesses. The discrete objects may represent physical things, temporalthings, abstract things, etc. For example, in the life sciences field,the discrete objects may be genes, proteins, cells, organisms, etc.Processes are actions that act on targets which are also discreteobjects, or on other processes. The information extractor may also inputmetadata for each fact. Metadata is generally information that describesthe circumstances under which a fact was observed, but may also includeinformation about the source of the information—for example, authors andpublication date of an article. An example of a fact is:

“ . . . GST-bax binds to bc2 . . . ”

[0072] The fact shown above comprises-two discrete objects, namely“GST-bax” and “bcl2.” The metadata for the fact may indicate that “theexperiment was performed with human bcl2 expressed and purified from CHOcells and recombinant GST fusions of human bax and bad in GST pulldownassays.” Additional information associated with the facts may also beinputted by the information extractor. Please refer to Appendix A forfurther details related to the type of information which may be enteredby an information extractor according an embodiment of the presentinvention. It should be apparent that the present invention is notrestricted to fact-based-information extraction models. Several othertypes of information extraction models may also be used according to thepresent invention.

[0073] In the fact-based information extraction embodiment describedabove, the information extractor may input this information usingnatural language sentences, via user interface templates provided byserver 12, using APIs provided by server 12, via diagrams or pictures,speech, fax, e-mail, or handwriting, or using any combinations of theaforementioned techniques and other techniques. Server 12 may beconfigured to parse the natural language sentences or templates, toidentify facts and metadata, to identify objects and processes from thefacts, and to determine ontological relationships between the objectsand processes, and store the extracted information in the knowledgebase.

[0074] While an information extractor is inputting information for aparticular article, the information stored in database 13 for thearticle is updated by server 12 to indicate that the article iscurrently undergoing information extraction. After server 12 receives asignal from the information extractor indicating that informationextraction for an article has been completed, the status informationrelated to the article in database 13 is updated to indicate thatinformation extraction for the article has been completed and that thearticle is now ready for the quality control process (step 67).

[0075] Server 12 may also allow an information extractor to providecomments related to an article. For example, if an information extractorexperiences any problems in extracting information f 6 r an article,server 12 allows the information extractor to provide details related tothe problem which are stored in database 13. These comments provideuseful information which may be used for later processing of thearticle. For example, the comments may indicate deficiencies with theexisting model for storing the extracted information, deficiencies inthe criteria for selecting articles, etc. In a specific embodiment ofthe present invention, where the extracted information is stored in aknowledge base based on an ontology, server 12 may enable theinformation extractor to indicate or discuss new terms or conceptsencountered in the extracted information. Information entered by theinformation extractor related to new terms or concepts may be usedduring the “information model review” phase (step 74) described below.The information extractor may also suggest a superclass for each newconcept or term. Information input by the information extractorregarding the new terms or concepts may be stored in database 13.

[0076] Server 12 may also provide features allowing informationextractors to access online help services. For example, server 12 mayprovide facilities allowing an information extractor to engage inreal-time communication with a human or non-human help system. Thesehelp services may be used by an information extractor for severalpurposes, such as to learn more about the process or protocols forinformation extraction, to discuss problems which may arise during theinformation extraction process, and other purposes.

[0077] According to an embodiment of the present invention, as part ofstep 66, after information extraction has been completed for an article,server 12 automatically records metrics associated with the informationextraction process for the article. These metrics may includeinformation indicating the total number of facts entered for thearticle, the time taken by the information extractor to extract thefacts, the length of the article, and other like information. Themetrics information is associated with the article and stored indatabase 13. This information may be used for several purposes such asto improve and optimize the performance of the information extractionprocess, to calculate payments due to the information extractor, todetermine the efficiency of the information extractor, to improveinformation extraction protocols/procedures, and for other purposes.

[0078] As stated above, after an information extractor has finishedinputting information for an article according to step 66, the status ofthe article stored in database 13 is changed to indicate that thearticle is ready for quality control processing (step 67). The articleis then automatically queued to undergo quality control processing. Uponentering the quality control stage, information related to the articlestored in database 13 is updated by server 12 to indicate that thearticle is in the quality control processing stage. Quality controlprocessing (step 68) is geared towards improving the accuracy of thedata entered by the information extractors, ensuring that theinformation has been extracted according to protocols/proceduresestablished by users of the present invention, identifying andcorrecting errors in the input data, determining error count perarticle, and performing other activities to improve the overall qualityand efficiency of the information extraction process. In general,quality control processing ensures the accuracy and completeness ofinformation being stored in information store 15.

[0079]FIG. 5 is a simplified flowchart 120 showing quality controlprocessing performed by an embodiment of the present invention as partof step 68 in FIG. 3. Quality control processing is generally initiatedwhen an article, which has been tagged as ready for quality control, isassigned by server 12 to a content reviewer (step 122). An article mayalso be simultaneously assigned to more than one content reviewer.Assigning an article to more than one content reviewer enables redundantquality control processing. A content reviewer may be any human being orapplication program which is configured to perform quality controlprocessing on the information input by the information extractor. Acontent reviewer may use client system 14 to view the article, to viewinformation input by the information extractor for the article, and toprovide feedback to server 12 regarding the input information. Server 12provides various features to facilitate quality control processing. Forexample, user interfaces may be provided which allow a content reviewerto review the information extracted for an article. For example, in anembodiment where the information extractor has inputted the extractedinformation in the form of facts, upon selection of an article by thecontent reviewer, facts entered by the information extractor for thearticle may be displayed to the content reviewer.

[0080] As information extractors develop expertise in the extraction ofinformation from articles and the proper structuring of that extractedinformation for insertion into information store 15 they may reach alevel of expertise sufficient to allow them additionally to perform thefunctions of content reviewers. Determination of when an informationextractor reaches the requisite skill level to perform as a contentreviewer can be based on any single criterion or several criteria.Completing an on-line training module, as well as an appropriateexamination can establish eligibility for the content reviewer position.Exceptional scores on any of the relevant metrics described herein forthe information extractors for a predetermined number of articles canalso establish an information extractor's ability to assume theresponsibilities of a content reviewer. In short, information extractorswho perform that role in an exemplary fashion may be eitherautomatically shifted to a content reviewer's job or invited to qualifyfor that position.

[0081] Using the various features provided by server 12, the contentreviewer determines and indicates to server 12 whether the articlecontains any extractable content (step 123). If the input received fromthe content reviewer indicates that there is no extractable content inthe article, the article is tagged accordingly and queued for futureinformation extraction (step 124). For example, an article may be taggedas not containing extractable content if the information contained inthe article is outside the scope of the domain of interest to the userof the invention. The status information related to the article indatabase 13 is updated to indicate that the article has been queued forfuture information extraction.

[0082] If the article has extractable content, the content reviewer thenassesses the structure and accuracy of the information input by theinformation extractor and indicates to server 12 if there are any errorsin the extracted information input for the article by the informationextractor (step 125). The errors may be due to inaccuracies in theextracted information input by the information extractor, due to theinformation extractor having failed to comply with establishedprocedures/protocols for information extraction, errors of omission onthe part of the information extractor, and other errors. If server 12determines that the error count associated with the article is greaterthan a pre-configured threshold error value (step 130), server 12reclassifies the article as “incomplete” (step 132). Information relatedto the article stored in database 13 is updated by server 12 to indicatethe incomplete status of the article. The incomplete article is thenreassigned to the information extractor for-correction of the errors inthe previously extracted information (step 134).

[0083] If the error count is below the threshold error value, server 14then allows the content reviewer to correct the errors (step 136).According to an embodiment of the present invention, server 12 providesvarious services and user interfaces which allow the content reviewer toedit the extracted information for an article to correct the errors. Forexample, in the embodiment where information is extracted in the form offacts, modules executing on server 12 may allow the content reviewer todelete facts, copy-facts, edit facts, and perform other like activities.These services and user interfaces may be accessed by the contentreviewer using client system 14.

[0084] According to an embodiment of the present invention, after errorsassociated with the article have been corrected by the content reviewer(step 138), server 12 then automatically records metrics related to thequality control processing for the article (step 140). The metricsinformation recorded by server 12 may include the number of edits madeby the content reviewer, the time taken for the quality control processfor the article, the error count for the article, the-type of errorsencountered by the content reviewer, and other like information. Themetrics information is associated with the article and stored indatabase 13.

[0085] Those individuals qualified as both information extractors andcontent reviewers allow for overall improvements in the efficiency withwhich information is extracted and entered into information store 15.Such dual-qualified individuals can perform either informationextraction or content review. As the backlogs of articles requiringeither information extraction or content review changes constantly, theadministrators of the knowledge acquisition process can assign andre-assign these dual-qualified individuals on an on-going, real-timebasis to insure that an optimal system throughput is maintained.Alternatively, the process of assigning these dual-qualified individualscan be fully automated, with these individuals first performing qualitycontrol processing on articles in the quality control queue and onlythen performing information extraction on pending articles.

[0086] Based on the quality control metrics information, server 12computes a quality control score for the article which is stored indatabase 13. For example, in an embodiment of the present inventionwhere the extracted information is stored in a knowledge base and uses afact-based information retrieval protocol, the quality control score(QC) for an article may be calculated according to the followingequation:${Q\quad C} = \frac{\left\{ {\left\lbrack {{0.25*\left( {{FE} + {F\quad M} + {ME} + {MM}} \right)} + {MF} + \left( {0.5*{EF}} \right)} \right\rbrack*100} \right\}}{{Total}\quad {{Facts}\left( {{post}\quad {quality}\quad {control}} \right)}}$

[0087] wherein,

[0088] FE=measures the number of fact data errors. These are errors inthe fact data input by the information extractor for the article;

[0089] FM=measures the missing fact data errors. These are errors ofomission when an information extractor fails to input required factinformation for the article;

[0090] ME=measures number of metadata errors. These are errors in themetadata input by the information extractor for the article;

[0091] MM=measures the missing metadata errors. These are errors ofomission in the metadata information input by the information extractorfor the article;

[0092] MF=measures the number of missing facts in the information inputby the information extractor for the article;

[0093] EF=is the number of extraneous facts information input by theinformation extractor for the article. Extraneous facts are generallyfacts entered by the information extractor but which do not qualify asfacts according to the information extraction protocol; and

[0094] Total Facts=is the total number of facts for the articledetermined after the quality control process.

[0095] According to the formula shown above, a low QC score indicateshigh quality (ideally if there are no errors, QC=0). It should beapparent that various other formulae and variables may be used inalternative embodiments of the present invention.

[0096] It is anticipated that the skill level of dual-qualifiedinformation extractors/content reviewers will be sufficient thatarticles they have extracted information from will not need qualitycontrol, but will rather be forwarded directly to ontologists, who willthen determine how to incorporate the extracted information intoinformation store 15 (see the discussion related to FIG. 8 below).

[0097] The metrics information recorded by server 12 may also be used togenerate reports related to the information extraction process. Thesereports may be generated on a periodic basis. The status of the articlein database 13 is then updated to indicate that quality control for thearticle has been completed (step 142). The article is then queued up forthe next processing step. According to an embodiment of the presentinvention, server 12 updates information associated with the informationextractor in database 13 to indicate that the information extractor iseligible to be paid for the article (step 144).

[0098] Referring back to FIG. 3, after an article has successfullypassed through the quality control step 68, the information extractor iscompensated for extracting information for the article (step 70). Thisprocess may be automatically triggered when information stored indatabase 13 for the information extractor is updated by server 12 toindicate that the information extractor is eligible for receivingcompensation for the article. Alternatively, the process may beautomatically triggered when the status of an article is updated toindicate that quality control processing for the article has beencompleted. The process may also be triggered by the informationextractor after the information extractor queries database 13 anddetermines that the article has completed the quality control process.Several different techniques may be used to compensate the informationextractor. For example, the information extractor may be monetarilycompensated, or may be compensated using other techniques such aspoints, stock options, etc.

[0099] According to an embodiment of the present invention, server 12determines the payment due to the information extractor based on thequality of work performed by the information extractor which may bebased on several factors such as the quality control score associatedwith the article, whether or not the article was reassigned forinformation extraction, the error count associated with the informationinput by the information extractor, and other like information.Information regarding the compensation payable to the informationextractor is stored in database 13.

[0100]FIG. 6 is a simplified flowchart 160 showing processing performedby an embodiment of the present invention for automatically calculatingthe compensation due to an information extractor. This embodimentassumes that the information has been extracted using a fact-basedinformation retrieval model. According to the embodiment depicted inFIG. 6, server 12 first determines a base rate (BR) of payment for thearticle (step 162). This base rate is generally stored in database 13.Server 12 then determines if the article was ever reassigned to theinformation extractor for corrections (step 164). If it is determinedthat the article was never reassigned, processing continues with step171. If the article was reassigned, server 12 then determines the numberof times that the article was reassigned (step 166). If the number oftimes that the article was reassigned is above a threshold value, server12 may indicate that the information extractor is not entitled tocompensation for the article (step 168). Information to this effect maybe stored in database 13. If the number of times that the article wasreassigned is equal to or below the threshold value, a new base rate maybe calculated by multiplying the current base rate by 90% (step 170).Processing then continues with step 171.

[0101] In step 171, server 12 compares the total number of facts for thearticle with a user-configurable low fact watermark value. According toa specific embodiment, the low fact watermark value is set to 10. If thefact count for the article is less than or equal to the low factwatermark value, a new base rate is calculated by multiplying thecurrent base rate by 75% (step 172). Processing then continues with step174. If the fact count for the article is greater than the low factwatermark value processing continues with step 174. In step 174, server12 compares the total number of facts for the article with auser-configurable high fact watermark value. According to a specificembodiment, the high fact watermark value is set to 50. If the factcount for the article is greater than the high fact watermark value, anew base rate is calculated by multiplying the current base rate by 125%(step 176). Processing then continues with step 178. If the fact countfor the article is less than or equal to the high fact watermark value,processing continues with step 178.

[0102] Server 12 then compares the quality score associated with thearticle with a user-configurable quality score threshold (step 178). Inan embodiment where lower quality scores correspond to better quality,if the quality score associated with the article is less than thequality score threshold, i.e. indicating-high quality, a new base rateis calculated by multiplying the current base rate by 120% (step 180).Processing then continues with step 182. If the quality score is greaterthan or equal to the quality score threshold, processing continues withstep 182.

[0103] In step 182, adjustments may be made to the calculated paymentrate. For example, adjustments may be made based on the geographicallocations of the information extractors, e.g. information extractorslocated in countries outside the US may be paid a higher or lower ratedepending on the prevailing market rates in that country. After theadjustments have been made, the final calculated payment rate indicatesthe compensation amount due to the information extractor for thearticle. This information is then stored in database 13 to facilitatepayment of the amount to the information extractor (step 184).

[0104] It should be apparent that the flowchart depicted in FIG. 6describes processing performed according to a specific embodiment of thepresent invention. Likewise, the percentage multipliers described aboveillustrate a particular embodiment of the present invention. Severalother techniques and multipliers may be used for calculatingcompensation due to the information extractor according to otherembodiments of the present invention. In terms of compensation,dual-qualified information extractors/content reviewers may becompensated at a rate that is greater than that used to compensateindividuals who are qualified only as information extractors or contentreviewers, or may be paid at different rates depending on the taskscompleted.

[0105] The actual payment of the compensation amount to the informationextractor may also be achieved using various techniques. According to aspecific embodiment, server 12 may send a message to an accounts payableapplication instructing the accounts payable application to issue acheck to the information extractor for the amount owed. Alternatively,server 12 may itself perform processing to pay the informationextractor. For example, the present invention may automatically creditthe information extractor's account for the amount due. The presentinvention may also issue a check to the information extractor for theamount owed. In an alternative embodiment, server 12 may provideinterfaces which allow accounts payable personnel to access informationstored in database 13. Information regarding the amount paid to theinformation extractor, when the amount was paid, and other likeinformation may be recorded in database 13.

[0106] Server 12 may also provide user interfaces which allowinformation extractors to determine the status of the articles for whichthey have extracted information. For example, a web page may bedisplayed for each information extractor displaying the status of thevarious articles for which the information extractor has extractedinformation. The web page may also display the status of compensationpayment for each article. FIG. 7 depicts an exemplary web page 190 whichmay be displayed to the information extractor by server 12. As shown inFIG. 7, web page 190 may display information 191 related to theinformation extractor such as the name of the information extractor, thecountry of residence of the information extractor, and theidentification number of the information extractor. As previouslystated, the identification number is usually assigned by server 12 touniquely identify the information extractor. Web page 190 may alsodisplay a list of articles 192 assigned to the information extractor forinformation extraction. Each article may be identified by an articleidentification number which, as previously stated, may be assigned byserver 12. For each article in the list, the status/progress of thearticle in the information extraction process may be displayed. Web page190 may also display quality control related metrics such as the “FactRange” the quality score calculated for the article, and other likeinformation. The “Fact Range” indicates the number of facts in anarticle which may be used to determine the information extractor'scompensation. For example, if an article has 10 or fewer facts it may beclassified as belonging to the “low” fact range and the informationextractor gets paid at a lower rate. If the article has 11 to 50 facts,the article may be classified as belonging to the “normal” fact rangeand the pay rate is adjusted accordingly. If there are 51 or more factsthe article may be classified as belonging to the “above” normal factrange and the pay rate is higher. The calculation of the pay rate basedon the number of facts in an article has been described above withrespect to FIG. 6. Additionally, web page 190 may also display paymentrelated information 193.

[0107] Referring back to FIG. 3, after quality control processing for anarticle has been completed, the status of the article in database 13 isupdated to indicate that the article is now ready for the nextprocessing phase. The article may then be queued up for a “informationmodel review” stage during which model reviewers are allowed to reviewthe information extracted from the article and determine if the modelused for storing the information in information store 15 needs to bechanged to accommodate the extracted information (step 74). The“information model” for an information store refers to the informationrepresentation used to store the information in information store 15.For example, for a knowledge base, the “model” may refer to an ontologyused to represent the knowledge in the knowledge base. As stated above,an ontology is typically a representation of the world or a part of theworld. For a relational database, the “model” may refer to the tablestructure used to store information. The model reviewers may be humanbeings trained to review the extracted information or applicationprograms configured to perform the review.

[0108] Server 12 provides several services and user interfaces whichfacilitate the model review process and which allow model reviewers toreview, change, or update the existing information model structure.Model reviewers may perform these activities using client systems 14coupled to server 12 via communication network 16. For example, if theinformation is stored in a knowledge base according to an ontology, themodel reviewers (or ontologists), can review new terms or concepts thatare introduced in the information extracted from the articles and makeappropriate changes to the ontology.

[0109]FIG. 8 is a simplified flowchart 200 showing processing performedby an embodiment of the present invention during the information modelreview stage. For the embodiment depicted in FIG. 8, it is assumed thatinformation extraction is based on a fact-based model and the extractedinformation is stored in a knowledge base based on an ontology.Flowchart 200 depicts processing performed by the embodiment of thepresent invention for reviewing new concepts or terms and making changesto the ontology to accommodate the new concepts or terms. The process isinitiated when server 12 identifies the new concepts associated with theextracted information (step 202). Information for each concept may- bestored in database 13. As previously described, information regardingthe possible presence of new concepts in the extracted information isgenerally indicated by the information extractor while inputting theextracted information during step 66 in FIG. 3. For example, theinformation input by the information extractor may indicate the newconcepts for the articles, the suggested superclass for each concept,information describing each concept, etc. Information stored in database13 for each concept may also include information about the source of theconcept, the date when the new concept was input to server 12, and otherlike information.

[0110] Server 12 then prioritizes the concepts and queues them up forassignment to the ontology reviewers (step 204). According to anembodiment of the present invention, server 12 may prioritize theconcepts based upon the same prioritization criteria used forprioritizing the articles. According to another embodiment, conceptswhich require changes to the ontology may be given a high priority sincethe ontology needs to be changed before the fact corresponding to theconcept can be entered into the knowledge base.

[0111] The new concepts or terms from the queue may then be triaged orassigned to ontologists that are responsible for different branches ofthe ontology (also called “branch ontologists”) (step 206). Informationassociated with the concepts in database 13 is updated to identify thebranch ontologist to whom the concept was assigned. According to anembodiment of the present invention, the assignment may be automaticallydriven by the superclass suggested for the new concept. For example, ifa new concept like “mouse” comes up, and has a suggested superclass of“mammal” associated with it, the new concept may be automaticallyassigned by server 12 to the branch ontologist-responsible for the“mammals” branch of the ontology.

[0112] Server 12 then allows the branch ontologist to whom the conceptwas assigned to indicate if the assignment was correct (step 207). Ifthe concept was erroneously assigned to the branch ontologist or if thebranch ontologist prefers to assign, the concept to another branchontologist, server 12 provides services to assign the concept to anotherbranch ontologist. If the concept was correctly assigned, processingcontinues with step 208.

[0113] Once the triage is done, the primary ontologist to whom a conceptis assigned is allowed to review the concept and information related tothe concept to determine if the ontology needs to be changed toaccommodate the concept. Server 12 may provide several user interfacesand services which facilitate the concept review process. For example,server 12 may provide services for viewing the new concepts, sorting theconcepts based on several criteria, viewing the suggested superclasses,adding/deleting new objects, adding/deleting slots, etc. The branchontologist may use these services and user interfaces to reviewinformation related to the concept and to provide concept reviewinformation to server 12 (step 208). The concept review informationinput by the branch ontologist may include classification informationfor the new concept, information defining or documenting the newconcept, and other information. The branch ontologist may also inputinformation for modeling the concept in the ontology.

[0114] After the branch ontologist has indicated that review of aconcept has been completed, information associated with the concept indatabase 13 is updated to indicate that concept review has beencompleted and that the concept is now awaiting approval from a secondaryontologist. The concept is then assigned to a secondary ontologist (step210) who reviews the information provided by the primary branchontologist and checks it for quality. Server 12 may provide userinterfaces and services which allow the secondary ontologist to reviewinformation input by the primary ontologist and to make changes to theinformation when necessary. The secondary ontologist provides feedbackon the work of the first ontologist to server 12 (step 212). If thequality of work of the primary ontologist is below a user-configurableacceptable quality threshold (step 214), the concept isreturned/reassigned to the primary ontologist for correction (step 216).Information associated with the reassigned concept may indicate errorsidentified by the secondary ontologist in the information input by theprimary branch ontologist. If the quality is above the threshold (i.e.the second ontologist has “approved” the new concept), informationassociated with the concept stored in database 13 is updated to indicatethat the concept or term has been approved (step 218). Server 12 keepstrack of the changes made to the ontology and the concepts/terms thathave been modeled. The information related to the changes may then bestored in database 13 (step 220). After new concepts associated with anarticle have been reviewed and approved, changes may then be made to theontology. The facts associated with these concepts are then ready to bestored in information store 15. Status information for the article indatabase 13 is updated to indicate that information from the article isready to be stored in information store 15.

[0115] According to an embodiment of the present invention, theprocessing depicted in FIG. 8 ensures that the extracted informationwill not be loaded into the information store 15 until changes to theinformation model have been proposed, reviewed, and accepted. Thisensures that the facts related information entered in the informationstore 15 does not violate the information model used for storing theinformation in information store 15.

[0116] When the information store is a relational database comprising aplurality of tables, the model reviewer determines if the structure ofone or more tables or the relationships between the tables need to bechanged to accommodate the information entered by the informationextractor. Server 12 may provide interfaces and services to facilitatethe review and change process. Likewise, server 12 may providefacilities for reviewing and amending the information models for othertypes of information stores such as object-oriented databases, and thelike.

[0117] After server 12 receives an indication from the model reviewerthat the model reviewer has completed review of the model for anarticle, server 12 changes the status of the article in database 13 toindicate completion of the model review phase for the article and toindicate that knowledge extracted from the article is now ready to bedeposited in information store 15.

[0118] Referring back to FIG. 3, after model review for an article hasbeen completed, the information extracted from the article isautomatically deposited and stored in information store 15 (step 76). Aspart of step 76, server 12 may process the extracted information andconvert it to a format suitable for storage in information store 15. Theinformation is then added to information store 15. For example, in aspecific embodiment of the present invention wherein information store15 is a knowledge base, server 12 may translate the extractedinformation to a format which is suitable for storing in a knowledgebase. Server 12 may check that the frames to which the information is tobe added exist. Server 12 may also add slots to the frames and thenpopulate the slots with the extracted information. The translatedinformation may then be stored in the knowledge base.

[0119] As described above, the present invention manages the process ofinformation extraction and storage. It should be apparent that the stepsshown in FIG. 3 can be performed concurrently. For example, while aninformation extractor is entering extracted information for a firstarticle, the present invention may be performing quality controlprocessing on a second article for which the information has alreadybeen input, performing model review for a third article, and may bestoring information in information store 15 for a fourth article, and soon. Accordingly, the tasks of identifying articles, identifyinginformation extractors, receiving the extracted information, qualitycontrol processing, model review, and storage of information can beperformed in parallel and in stages.

[0120] As described herein, both the information extraction process andthe content review process may be geographically distributed. There islittle need for a physical concentration of individuals in one place, asthe training material may be provided on a web site accessed through theInternet and the articles selected for information extraction and forcontent review may also be provided in electronic versions over theInternet. For the task of content review, both the original article, aswell as the results of the information extraction may be provided overthe Internet as electronic documents. Once this electronic distributionnetwork is established, it can be utilized in several ways to minimizethe total costs of populating information store 15. At any given time,content reviewers in several different countries will be available toreview articles that have already gone through the informationextraction process. As salaries vary from country to country forindividuals with equivalent skill sets, it is possible to designateautomatically content reviewers who work for a generally lower rate ofcompensation to receive more work than those paid at a higher rate. Acertain minimum amount of content review work should flow to allindividuals qualified for such work both to retain the services of theseindividuals as well as to keep their skills well honed. Similar workallocation can also occur in the information extraction process, as workcan first be distributed to less well-compensated individuals, then tothose who are working for a higher compensation level. Again, to retainthe services of all qualified information extractors, a certain minimumnumber of articles should be provided to each qualified informationextractor. Alternatively, better-qualified extractors and reviewers maybe given the opportunity to select articles for extraction or qualitycontrol review. As another alternative, articles may be assigned basedon the types of articles the extractor has previously been assigned.

[0121]FIGS. 9A-9C depict information which may be stored in database 13according to an embodiment of the-present invention. In the embodimentdepicted in FIGS. 9A-9C, the information is stored in the form of tableswith links between the tables. Table Concepts 244 stores information forconcepts which may be included in user criteria 52 (see FIG. 3) and usedfor identifying articles from which information is to be extracted.Information about the terms which may be used to describe the conceptsis stored in Table Terms 250. Table ConceptReference 248 storesinformation which is used to map the terms to the concepts. Informationregarding the source and description of the terms is stored in TableTermSource 252 and Table Description 256, respectively. Informationrelated to the various categories used for searching the articles isstored in Table Category 254. Contextual information related to thecategories is stored in Table ArcheTypes 246. For example, if a “gene”category was used for the search, Table ArcheTypes 246 may storecontextual information about the gene such as the type of the gene, theorganismal source of the gene, the chemical structure of the gene, andother like information.

[0122] Tables CMAArticles 240 and CMAJournals 242 store informationabout articles which are candidates for information extraction. Thestored information may include information which allows informationextractors to access the article, such as URL information. These tablesalso store publication date information for the articles, the date whenthe article was identified, and other descriptive information for thearticle.

[0123] As previously described, a variety of metrics information iscaptured at various stages of the processing. Table AMSArticle 258stores the metrics information for the articles. The stored informationmay include metrics related to the information extraction process,metrics recorded during the quality control process, information forcalculating the quality control score for each article, metrics used fordetermining the amount of compensation due to information extractors,and other like information.

[0124] Table AMSConcepts 262 stores information about new concepts orterms that need to be modeled in the ontology. The information in TableAMSConceptTranscript 264 is updated by the ontologists during the modelreview stage, and describes how new concepts are to be modeled in theontology. Table AMSDocument 260 stores information which is used forconverting the extracted information into a format which facilitatesstorage in the knowledge base. Table AbstractMarkup 266 stores resultsrelated to the automatic verification of articles based on the titlesand/or the abstracts of the articles. This information may indicate whya particular article was or was not deemed relevant by server 12. Thisinformation may be used to manually verify and categorize articles whichcould not be unambiguously verified and categorized by server 12.

[0125] As described above, queues are used at various stages ofprocessing. Tables QueueItems 268, QueueItemData 270, and QueueItemLog272 store information related to the queues. Table QueueItems 268 storesinformation mapping individual items and the queues containing theitems. Table QueueltemData 270 stores information which is used forprioritizing the articles in the queues. Table QueueItemLog 272 is usedfor logging information related to the queue items. It should beapparent that FIGS. 9A-9C describe a specific embodiment of the presentinvention and do not limit the scope of the present invention as recitedin the claims.

[0126] Although specific embodiments of the invention have beendescribed, various modifications, alterations, alternativeconstructions, and equivalents are also encompassed within the scope ofthe invention. The described invention is not restricted to operationwithin certain specific data processing environments, but is free tooperate within a plurality of data processing environments. For example,the present invention may be used to extract and store information forany domain or industry which benefits from the information extractionand storage. Additionally, although the present invention has beendescribed using a particular series of transactions and steps, it shouldbe apparent to those skilled in the art that the scope of the presentinvention is not limited to the described series of transactions andsteps.

[0127] Further, while the present invention has been described using aparticular combination of hardware and software, it should be recognizedthat other combinations of hardware and software are also within thescope of the present invention. The present invention may be implementedonly in hardware or only in software or using combinations thereof.

[0128] The specification and drawings are, accordingly, to be regardedin an illustrative rather than a restrictive sense. It will, however, beevident that additions, subtractions, deletions, and other modificationsand changes may be made thereunto without departing from the broaderspirit and scope of the invention as set forth in the claims.

1.-3. (Canceled).
 4. A method for constructing a knowledge representation, the method comprising the steps of: selecting articles to serve as an information source for the knowledge representation; extracting and formatting information contained in the articles for storage in the knowledge representation including representing a fact expressed in an article's natural language as at least an object and process relationship; verifying that the information extracted from the selected articles is correct and that it has been placed in the correct format; and storing the formatted information in the knowledge representation.
 5. The method of claim 4 wherein the extracting information step is performed by knowledge extraction personnel and the verifying step is performed by quality control personnel.
 6. The method of claim 5 wherein both the extracting step and verifying step are performed by the same person, which person has been qualified by a predetermined procedure to perform both steps simultaneously.
 7. The method of claim 4 wherein at least the steps of extracting and verifying occur in geographically separated locations.
 8. The method of claim 7 wherein the geographically separate locations are chosen based upon the cost of performing the respective steps of extracting and verifying, the lowest cost location for each step being selected.
 9. The method of claim 4, wherein the extracting information step includes using a computer-driven parser of natural language.
 10. The method of claim 4, wherein the representing step includes representing an object and process relationship in the form of the process being an action that acts upon the object.
 11. The method of claim 4, wherein the representing step includes representing an object and process relationship in the form of the first object being an effector of the process and the process is an action that acts upon one or more second objects.
 12. A system for extracting information from articles originating from a first database and storing the extracted information in a second database, the system comprising: an information extraction unit which extracts a finding from an article's natural language and translates this finding into a structured finding comprising at least an object, process, and a relationship between the object and process; a database management unit in communication with the information extraction unit for determining if the structured finding has been properly formatted for storage in the second database; an information storage unit in communication with the second database for storing the structured finding in the second database.
 13. The system of claim 12, further comprising a query management and information display unit for responding to user inquiries for information stored in the second database and for retrieving information from the second database in response to those queries.
 14. The system of claim 12, wherein the second database is frame-based.
 15. The system of claim 12, wherein the structured finding is formatted according to a fact-based model.
 16. The system of claim 12, wherein the relationship between the object and process takes the form of the process is an action that acts upon the object.
 17. The system of claim 12, wherein the object is a gene, protein, cell, or organism.
 18. The system of claim 12, wherein the finding is derived from one or more sentences, a portion of a sentence, a diagram, figure or table.
 19. The system of claim 12, wherein the second database includes an ontology.
 20. The system of claim 12, wherein the first database is coupled to, and in communication with the information extraction unit.
 21. The system of claim 12, further including an article selection unit, for selecting articles for information extraction from among a plurality of articles residing in the first database.
 22. The system of claim 12, wherein the article's representation of the finding has a first semantic structure and wherein the translation of the finding includes a translation of the finding into a natural language having a second semantic structure.
 23. The system of claim 12, wherein information is extracted using a user template.
 24. The system of claim 12, wherein information is extracted using a computer-driven parser of the natural language.
 25. The system of claim 12, wherein the structured finding comprises a first object, second object and a process relationship.
 26. The system of claim 25, wherein the second object is an additional process or pathway.
 27. The system of claim 25, wherein the first object is an effector of the process and the process is an action that acts upon the second object and that is mediated by a third object.
 28. The system of claim 25 wherein the first object, second object and process include modifiers.
 29. The system of claim 25, wherein the first object is a first comparison of the property between two objects or processes, the second object is a second comparison of the property between two objects or processes, and the process is a relative comparison between the first and second comparisons.
 30. The system of claim 12, wherein the process is a process modifier and the structured finding further includes a second object that is a process or a pathway.
 31. The system of claim 12, wherein the object and process relationship takes the form of a first object as an effector of the process and the process indicates an action that acts upon one or more second objects.
 32. The system of claim 31, wherein the process indicates a lack of action upon the one or more second objects.
 33. The system of claim 12, wherein the object contains one or more modifiers.
 34. The system of claim 12, wherein the object and process include property annotations indicating one of a cellular, organ, or other physical location.
 35. The system of claim 12, wherein the object is an effector of a plurality of processes and all of these processes are actions that act upon a second object.
 36. The system of claim 12, wherein the article's natural language includes a first and second finding and wherein the first finding comprises the process and object and the object includes the second finding. 